33 research outputs found

    Data augmentation and transfer learning to classify malware images in a deep learning context

    Get PDF
    In the past few years, malware classification techniques have shifted from shallow traditional machine learning models to deeper neural network architectures. The main benefit of some of these is the ability to work with raw data, guaranteed by their automatic feature extraction capabilities. This results in less technical expertise needed while building the models, thus less initial pre-processing resources. Nevertheless, such advantage comes with its drawbacks, since deep learning models require huge quantities of data in order to generate a model that generalizes well. The amount of data required to train a deep network without overfitting is often unobtainable for malware analysts. We take inspiration from image-based data augmentation techniques and apply a sequence of semantics-preserving syntactic code transformations (obfuscations) to a small dataset of programs to generate a larger dataset. We then design two learning models, a convolutional neural network and a bi-directional long short-term memory, and we train them on images extracted from compiled binaries of the newly generated dataset. Through transfer learning we then take the features learned from the obfuscated binaries and train the models against two state of the art malware datasets, each containing around 10 000 samples. Our models easily achieve up to 98.5% accuracy on the test set, which is on par or better than the present state of the art approaches, thus validating the approach

    A Semantics-Based Approach to Malware Detection

    Get PDF
    Malware detection is a crucial aspect of software security. Current malware detectors work by checking for signatures, which attempt to capture the syntactic characteristics of the machine-level byte sequence of the malware. This reliance on a syntactic approach makes current detectors vulnerable to code obfuscations, increasingly used by malware writers, that alter the syntactic properties of the malware byte sequence without significantly affecting their execution behavior. This paper takes the position that the key to malware identification lies in their semantics. It proposes a semantics-based framework for reasoning about malware detectors and proving properties such as soundness and completeness of these detectors. Our approach uses a trace semantics to characterize the behavior of malware as well as that of the program being checked for infection, and uses abstract interpretation to ``hide'' irrelevant aspects of these behaviors. As a concrete application of our approach, we show that (1) standard signature matching detection schemes are generally sound but not complete, (2) the semantics-aware malware detector proposed byChristodorescu et al. is complete with respect to a number of common obfuscations used by malware writers and (3) the malware detection scheme proposed by Kinder et al. and based on standard model-checking techniques is sound in general and complete on some, but not all, obfuscations handled by the semantics-aware malware detector

    Graceful Interruption of Request-Response Service Interactions

    Get PDF
    Bi-directional request-response interaction is a standard communication pattern in Service Oriented Computing (SOC). Such a pattern should be interrupted in case of faults. In the literature, different approaches have been considered:WS-BPEL discards the response, while Jolie waits for it in order to allow the fault handler to appropriately close the conversation with the remote service. We investigate an intermediate approach in which it is not necessary for the fault handler to wait for the response, but it is still possible on response arrival to gracefully close the conversation with the remote service

    Formal framework for reasoning about the precision of dynamic analysis

    Get PDF
    Dynamic program analysis is extremely successful both in code debugging and in malicious code attacks. Fuzzing, concolic, and monkey testing are instances of the more general problem of analysing programs by dynamically executing their code with selected inputs. While static program analysis has a beautiful and well established theoretical foundation in abstract interpretation, dynamic analysis still lacks such a foundation. In this paper, we introduce a formal model for understanding the notion of precision in dynamic program analysis. It is known that in sound-by-construction static program analysis the precision amounts to completeness. In dynamic analysis, which is inherently unsound, precision boils down to a notion of coverage of execution traces with respect to what the observer (attacker or debugger) can effectively observe about the computation. We introduce a topological characterisation of the notion of coverage relatively to a given (fixed) observation for dynamic program analysis and we show how this coverage can be changed by semantic preserving code transformations. Once again, as well as in the case of static program analysis and abstract interpretation, also for dynamic analysis we can morph the precision of the analysis by transforming the code. In this context, we validate our model on well established code obfuscation and watermarking techniques. We confirm the efficiency of existing methods for preventing control-flow-graph extraction and data exploit by dynamic analysis, including a validation of the potency of fully homomorphic data encodings in code obfuscation

    Choreographies in Practice

    Full text link
    Choreographic Programming is a development methodology for concurrent software that guarantees correctness by construction. The key to this paradigm is to disallow mismatched I/O operations in programs, called choreographies, and then mechanically synthesise distributed implementations in terms of standard process models via a mechanism known as EndPoint Projection (EPP). Despite the promise of choreographic programming, there is still a lack of practical evaluations that illustrate the applicability of choreographies to concrete computational problems with standard concurrent solutions. In this work, we explore the potential of choreographies by using Procedural Choreographies (PC), a model that we recently proposed, to write distributed algorithms for sorting (Quicksort), solving linear equations (Gaussian elimination), and computing Fast Fourier Transform. We discuss the lessons learned from this experiment, giving possible directions for the usage and future improvements of choreography languages

    Using Verification Technology to Specify and Detect Malware

    Get PDF
    Computer viruses and worms are major threats for our computer infrastructure, and thus, for economy and society at large. Recent work has demonstrated that a model checking based approach to malware detection can capture the semantics of security exploits more accurately than traditional approaches, and consequently achieve higher detection rates. In this approach, malicious behavior is formalized using the expressive specification language CTPL based on classic CTL. This paper gives an overview of our toolchain for malware detection and presents our new system for computer assisted generation of malicious code specifications

    Abstract Interpretation of Indexed Grammars.

    Get PDF
    Indexed grammars are a generalization of context-free grammars and recognize a proper subset of context-sensitive languages. The class of languages recognized by indexed grammars are called indexed languages and they correspond to the languages recognized by nested stack automata. For example indexed grammars can recognize the language {a^n b^n c^n | n > = 1} which is not context-free, but they cannot recognize {(ab^n)^n) | n >= 1} which is context-sensitive. Indexed grammars identify a set of languages that are more expressive than context-free languages, while having decidability results that lie in between the ones of context-free and context-sensitive languages. In this work we study indexed grammars in order to formalize the relation between indexed languages and the other classes of languages in the Chomsky hierarchy. To this end, we provide a fixpoint characterization of the languages recognized by an indexed grammar and we study possible ways to abstract, in the abstract interpretation sense, these languages and their grammars into context-free and regular languages

    Formal Framework for Property-driven Obfuscations

    Get PDF
    We study the existence and the characterization of function transformers that minimally or maximally modify a function in order to reveal or conceal a certain property. Based on this general formal framework we develop a strategy for the design of the maximal obfuscating transformation that conceals a given property while revealing the desired observational behaviou

    A Categorical Treatment of Malicious Behavioral Obfuscation

    Get PDF
    International audienceThis paper studies malicious behavioral obfuscation through the use of a new abstract model for process and kernel interactions based on monoidal categories. In this model, program observations are consid-ered to be finite lists of system call invocations. In a first step, we show how malicious behaviors can be obfuscated by simulating the observa-tions of benign programs. In a second step, we show how to generate such malicious behaviors through a technique called path replaying and we extend the class of captured malwares by using some algorithmic transformations on morphisms graphical representation. In a last step, we show that all the obfuscated versions we obtained can be used to detect well-known malwares in practice
    corecore